Welcome to Assignment 3, Data Exploration

Instructions:

Below we’ve transcribed the assignment instructions with corresponding code blocks beneath each.

🚨Please note,

  1. Load into R the expression data and matching metadata that you processed in Assignment 2.

Please insert your data files in the data directory such that the single folder from refine.bio containing the metadata TSV and main TSV file are present in the immediate subdirectory. This can be downloaded from google drive, here!

This next step loads data and annotation libraries per the instructions, but does not attempt to transform our Gene column to the Entrez IDs as this was unnecessary for downstream processing.

# Check that files exist and are usable.
file.exists(data_file)
## [1] TRUE
file.exists(metadata_file)
## [1] TRUE

Below we read in the data but we also take additional measures to trim out ambiguously-labeled samples to give clusters a better chance at lining up with our diseased/not-diseased groupings in the final statistics step.

  1. Unsupervised Analysis
    1. Subset your data to the 5,000 most variable genes
    2. Using that subset of the data, select and run a clustering algorithm from this list:
      1. K-means
      2. Hierarchical clustering (hclust)
      3. ConsensusClusterPlus
      4. PAM Clustering
      5. Gaussian Mixture Models
    3. Each student in your team should run a different clustering method (e.g., if there are 4 students
      on the team, there should be results for 4 clustering methods in your assignment writeup).
    4. How many clusters did each method find?
      1. If you ran a method that requires you to select the number of clusters (k), how did changing this value change the results? Compare cluster membership at each k to investigate this.
      2. If you ran a method that selects k for you, describe why it chose that k.
    5. Rerun each clustering method using different numbers of genes. Try 10, 100, 1000, and 10000 genes.
      1. How did the number of genes affect clustering?
      2. Create an alluvial diagram (Sankey plot) to visualize how the different clustering setups
        changed cluster memberships for each sample. (https://cran.rproject.org/web/packages/ggalluvial/vignettes/ggalluvial.html)

1A: Subsetting to Most Variable Genes

1B:

Rayyan - K-means

# visualize ideal K for K-means clustering using the elbow method on within-sum-squares metric

if(!require("purrr"))
  install.packages("purrr")

if(!require('factoextra'))
  install.packages('factoextra')

set.seed(123)
#Expect a delay, takes a bit
fviz_nbclust(top_5000, kmeans, k.max = 6, method = "wss")

# using the elbow method, the optimal number of clusters appears to be ~3

The elbow method visualized above reruns K-Means with different amounts of clusters and visualizes how “Total within sum of square” decreases. This “within sum of square” metric explains proximity of points to their centroid. Lower values mean tighter groupings. The elbow method means we look for where the elbow in the curve occurs, picking a number of clusters K that does not overfit the data, but also provides a significant reduction in WSS compared to K-1 clusters. Here we choose 3 as optimal.

#about 20 seconds
kmeans_result_5000_three <- kmeans(top_5000,3,iter.max = 10, nstart = 25)
kmeans_result_5000_four <- kmeans(top_5000,4,iter.max = 10, nstart = 25)
kmeans_result_5000_five <- kmeans(top_5000,5,iter.max = 10, nstart = 25)
#About 30 seconds
fviz_cluster(kmeans_result_5000_three, data = top_5000, geom="point")

Above we can see how these 3 clusters look when displayed in a 2D graph via PCA. This is a healthy “sanity-check” indicating our choice to use four clusters (per elbow method) is indeed acceptable. We do note that the graph does appear to show overlap between these three c

If you ran a method that requires you to select the number of clusters (k), how did changing this value change the results? Compare cluster membership at each k to investigate this.

The K-means clustering method required me to select the number of clusters. To determine the ideal “k”, I utilized the “factoextra” package to generate a graph showing the number of clusters versus the total within the sum of square. Using the elbow method, I determined that the optimal number of clusters was 2, and to test how changing the value changed the results, ran the “kmeans” clustering command for a k of 3,4, and 5 to determine how changing k changed the results.

Cluster membership at 3, 4, and 5 was:

3 - 843, 493, 265

4 - 591, 491, 308, 211

5 - 491, 370, 299, 243, 198,

At a k of 3, reduction in sum of squares was 4.9%, and at a k of 4, it rose only slightly to 5.5 %. At a k of 5, this rises only slightly to 5.9 %. These results confirm the results of the fviz_nbclust command, showing that there are diminishing returns in reduction in sum of squares as the number of clusters increases. By looking at cluster membership at k = 3,4, and 5 we can see that cluster membership does not change drastically between 3 and 4, as it is the largest cluster that ends up splitting to form the fourth cluster. Meanwhile, between a k of 4 and 5, the new cluster takes away members from other clusters, not just the largest.

# rerun K-Means clustering method with different number of genes AND K values for the alluvial down the line
kmeans_result_10_three <- kmeans(top_10,3,iter.max = 10,nstart=25)
kmeans_result_10_four <- kmeans(top_10,4,iter.max = 10,nstart=25)
kmeans_result_10_five <- kmeans(top_10,5,iter.max = 10,nstart=25)
kmeans_result_100_three <- kmeans(top_100,3,iter.max = 10,nstart=25)
kmeans_result_100_four <- kmeans(top_100,4,iter.max = 10,nstart=25)
kmeans_result_100_five <- kmeans(top_100,5,iter.max = 10,nstart=25)
kmeans_result_1000_three <- kmeans(top_1000,3,iter.max = 10,nstart=25)
kmeans_result_1000_four <- kmeans(top_1000,4,iter.max = 10,nstart=25)
kmeans_result_1000_five <- kmeans(top_1000,5,iter.max = 10,nstart=25)
kmeans_result_10000_three <- kmeans(top_10000,3,iter.max = 10,nstart=25)
kmeans_result_10000_four <- kmeans(top_10000,4,iter.max = 10,nstart=25)
kmeans_result_10000_five <- kmeans(top_10000,5,iter.max = 10,nstart=25)
#print(kmeans_result_10000_three)

How did the number of genes affect clustering?

Cluster Membership at 10, 100, 1000, and 10,000 (with a k of 3) was:

10 - 612, 520, 469

100 - 760, 474, 367

1000 - 797, 500, 304

10,000 - 852, 492, 257

Looking solely at cluster membership, it appears that each time the number of genes increased, membership in the largest cluster steadily increased while membership in the other clusters varyingly decreased and increased. As the number of genes increased, a reduction in sum of squares at a k of 3 was calculated as 34.8 %, 11.9 %, 7.5 %, and 4.2 % at 10, 100, 1,000, and 10,000, respectively. As the number of genes increased, the reduction in sum of squares greatly decreased as well.

alluvialData <- data.frame(
  K3 = kmeans_result_10_three$cluster, 
  K4 = kmeans_result_10_four$cluster, 
  K5 = kmeans_result_10_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
  geom_alluvium(aes(fill = K3)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "KMeans for 10 Samples at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

The first of the alluvial plots here shows only 10 genes worth of data. We can see how the samples in cluster 1 bifurcate early, while the samples of cluster 2 at K=4 column bifurcate into groups 3 and 5 at K=5. Nothing overwhelmingly insightful here, but lets see what happens with more genes.

alluvialData <- data.frame(
  K3 = kmeans_result_100_three$cluster, 
  K4 = kmeans_result_100_four$cluster, 
  K5 = kmeans_result_100_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
  geom_alluvium(aes(fill = K3)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "KMeans for 100 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

Things get more complicated with more genes. We can see the pattern of bifurcations appears to be significantly different although of some intrigue is the relatively robust group 2 chosen at K=3 (first column) which seems to barely bifurcate at all (a small fraction goes to group 2 at K=5, last column). Perhaps this group is the mostly tight, whereas clearly the grey bands of group 1 from K3 (again, first column) are going many places indicating it was a poor discriminant and likely of lesser significance.

alluvialData <- data.frame(
  K3 = kmeans_result_1000_three$cluster, 
  K4 = kmeans_result_1000_four$cluster, 
  K5 = kmeans_result_1000_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
  geom_alluvium(aes(fill = K3)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "KMeans for 1000 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

Again the pattern changes dramatically. This is unsurprising given we are adding more highly variable genes to the mix. Little more is to be said.

alluvialData <- data.frame(
  K3 = kmeans_result_10000_three$cluster, 
  K4 = kmeans_result_10000_four$cluster, 
  K5 = kmeans_result_10000_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
  geom_alluvium(aes(fill = K3)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "KMeans for 10,000 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

This last diagram is perhaps the most significant in examining groups since it is based on the most data. Cluster 2(K3) is exceptional here and bifurcates minimally indicating a high “goodness of clustering”.

Sahas - Hierarchical Clustering (hclust)

Note, ~1 minute runtime on Macbook Air M2 (known for CPU), expect delay.

# 5000 most variable genes Hierarchical Clustering
distance <- dist(top_5000, method = "euclidean")
hcluster  <- hclust(distance, method = "complete")
# 1 cluster
hcluster_result_5000_two <- cutree(hcluster, k = 2)
# 2 clusters
hcluster_result_5000_three <- cutree(hcluster, k = 3)
# 3 clusters 
hcluster_result_5000_four <- cutree(hcluster, k = 4)

# Visualize dendrogram
h <- plot(hcluster, cex = 0.8, hang = -1)

#rect.hclust(hcluster, k = 2, border = 2:5)

# If you ran a method that requires you to select the number of clusters (k), how did changing # this value change the results? Compare cluster membership at each k to investigate this.

# Visualize clusters
fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_two), geom = "point")

fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_three), geom = "point")

fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_four), geom = "point")

This set of graphs is quite interesting. The two cluster graph amusingly displays what appears to be a single cluster, likely another folly of the 2D world. The three cluster graph shows how the larger visible cluster has bifurcated (this time quite literally given the nature of hierarchical clustering). The four cluster graph further divides these two larger clusters, though it is unclear from which the points of cluster 1 originate, but the extremely tight cluster (now cluster 4) remains centered and nearly invisible.

Expect further delay (~1 mins?)

# HClust Top 10 fails to work here, as there is presumably not enough data? The origin of the issue is unclear but the dendrogram is deemed invalid.

# 10 most variable genes Hierarchical Clustering
#distance <- dist(top_10, method = "euclidean")
#hclusterTop10  <- hclust(distance, method = "complete")
#plot(hclusterTop10, cex = 0.8, hang = -1, labels = TRUE)
#hcluster_result_10_two <- cutree(hclusterTop10, k = 2)

# 100 most variable genes Hierarchical Clustering
distance <- dist(top_100, method = "euclidean")
hclusterTop100  <- hclust(distance, method = "complete")
plot(hclusterTop100, cex = 0.8, hang = -1, labels = FALSE)

hcluster_result_100_two <- cutree(hclusterTop100, k = 2)

# 1000 most variable genes Hierarchical Clustering
distance <- dist(top_1000, method = "euclidean")
hclusterTop1000  <- hclust(distance, method = "complete")
plot(hclusterTop1000, cex = 0.8, hang = -1, labels = FALSE)

hcluster_result_1000_two <- cutree(hclusterTop1000, k = 2)

# 10000 most variable genes Hierarchical Clustering
distance <- dist(top_10000, method = "euclidean")
hclusterTop10000  <- hclust(distance, method = "complete")
plot(hclusterTop10000, cex = 0.8, hang = -1, labels = FALSE)

hcluster_result_10000_two <- cutree(hclusterTop10000, k = 2)

The dendrograms above help to illustrate the tree being produced by running variations of hierarchical clustering. Of interest is the very large offshoot of the majority of the genes segregated away from smaller groups in the 10,000 gene dendrogram. This will have consequences on the alluvial visuals below, and raises questions about logarithmically scaling the width of the bands alluvial diagrams produce, rather than linearly.

Due to the nature of hierarchical clustering making the alluvial diagrams exceptionally disinteresting, we’ve opted to highlight only alluvials at 10 genes and 10,000 genes but with more divisions (you’re more or less looking at a horizontal dendrogram…)

hcluster_result_100_two <- cutree(hclusterTop100, k = 2)
hcluster_result_100_three <- cutree(hclusterTop100, k = 3)
hcluster_result_100_four <- cutree(hclusterTop100, k = 4)
hcluster_result_100_five <- cutree(hclusterTop100, k = 5)
hcluster_result_10000_two <- cutree(hclusterTop10000, k = 2)
hcluster_result_10000_three <- cutree(hclusterTop10000, k = 3)
hcluster_result_10000_four <- cutree(hclusterTop10000, k = 4)
hcluster_result_10000_five <- cutree(hclusterTop10000, k = 5)

alluvial_data <- data.frame(
  K2 = hcluster_result_100_two,
  K3 = hcluster_result_100_three,
  K4 = hcluster_result_100_four,
  K5 = hcluster_result_100_five
)

ggplot(data = alluvial_data, aes(axis1 = K2, axis2 = K3, axis3 = K4, axis4 = K5)) +
  geom_alluvium(aes(fill = K2)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "Hierarchical Clustering across different values of K at 10 genes")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

alluvial_data <- data.frame(
  K2 = hcluster_result_10000_two,
  K3 = hcluster_result_10000_three,
  K4 = hcluster_result_10000_four,
  K5 = hcluster_result_10000_five
)

ggplot(data = alluvial_data, aes(axis1 = K2, axis2 = K3, axis3 = K4, axis4 = K5)) +
  geom_alluvium(aes(fill = K2)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "Hierarchical Clustering across different values of K at 10000 genes")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

There are two diagrams here, the first showing almost nothing of interest because as per the nature of hierarchical clustering, groups are perfectly divided and do not allow samples to cross over to other branches halfway down the tree. Thus from K2-->3 to 3, 2 bifurcates to 2/3, from K3-->4, 1 bifurcates to 1 and 2, and so on. Very simple. As for the second alluvial…

You’re probably thinking “that doesn’t look right”. Au contraire. If we look at the dendrogram for 10,000 genes we see how the ensuing splits in the hierarchy take place on the right side with one giant arm reaching to the left This is the “cluster 2” which is renamed to cluster 4 on the final column by chance, but represents this large undivided branch of the tree that is only subdivided later in the hierarchy.

Michael - PAM Clustering

library(cluster)

#Center data around median gene expressions in each row, without touching OG data. This came from the ConsensusClusterPlus guide, though "Scale" which the internet recommended for use with PAM would accomplish the same thing just with means instead of medians.
top_10m = sweep(top_10,1, apply(top_10,1,median,na.rm=T))
top_100m = sweep(top_100,1, apply(top_100,1,median,na.rm=T))
top_1000m = sweep(top_1000,1, apply(top_1000,1,median,na.rm=T))
top_5000m = sweep(top_5000,1, apply(top_5000,1,median,na.rm=T))
top_10000m = sweep(top_10000,1, apply(top_10000,1,median,na.rm=T))

#Takes quite a while at 10,000
k <- 3
pam_result_10_three <- pam(top_10m, k = k)
pam_result_100_three <- pam(top_100m, k = k)
pam_result_1000_three <- pam(top_1000m, k = k)
pam_result_5000_three <- pam(top_5000m, k = k)
pam_result_10000_three <- pam(top_10000m, k = k)

🚨The instructions linked to PAM suggested the use of a dissimilarity matrix but I couldnt get fviz and other operations downstream to work with it, so im just doing PAM the classic way (it generates the dissimilarity matrix). I suspect this is why the run time is so crippling. Please consider enjoying some popcorn or perhaps painting the nearest surface before hitting run on this next cell, so that you may enjoy some semblance of stimulation. Your RAM usage graph may also be amusing to watch.

#Let's compare K values:

cluster_counts <- table(pam_result_5000_three$clustering)
cluster_counts
## 
##   1   2   3 
## 469 291 293
pam_result_5000_four <- pam(top_5000m, k = 4)
cluster_counts <- table(pam_result_5000_four$clustering)
cluster_counts
## 
##   1   2   3   4 
## 388 285  97 283
pam_result_5000_five <- pam(top_5000m, k = 5)
cluster_counts <- table(pam_result_5000_five$clustering)
cluster_counts
## 
##   1   2   3   4   5 
## 338 101 236  95 283
pam_result_5000_six <- pam(top_5000m, k = 6)
cluster_counts <- table(pam_result_5000_six$clustering)
cluster_counts
## 
##   1   2   3   4   5   6 
## 335 101 236  92  80 209
fviz_cluster(pam_result_5000_three, geom = "point")

fviz_cluster(pam_result_5000_four,  geom = "point")

fviz_cluster(pam_result_5000_five, geom = "point")

fviz_cluster(pam_result_5000_six, geom = "point")

We can see how membership changes above, at least in the two chosen dimensions for the visualization (since we’re operating on 5000-dimensional data). Indeed it seems that more clusters reduce the overlap, but the extent to which they reduce the overlap is better demonstrated through the sillhouette method shown below, because, amusingly, the new clusters introduced in the 2D field above show literally 0 reduction in overlap. 4 of the generated clusters overlap almost perfectly with eachother, implying the PCA has failed to adequately visualize these groups.

🚨Note, takes ~4 minutes on macbook air M2.

#Takes around 4 minutes on a macbook air M2
library(factoextra)
fviz_nbclust(top_5000m, pam, k.max = 6, verbose = TRUE, print.summary = TRUE, method = "silhouette")

#should identify 2 as optimal K value, which is a function of sillhouette score

The sillhouette method used above is looking at all the points resulting from different K values of PAM, giving them a score (based on their proximity to their assigned cluster and distance to other clusters), and then averaging the score accross all points to determine a total score that represents discrimination. We’d expect this to increase with more clusters but alas it seems 1 cluster is locally optimal in effectively segregating our samples. This is pretty odd, but, the numbers dont seem to lie but the ensuing sillhouette scores aren’t terribly worse, so I’ve chosen to show k=5 below.

Below I visualize the sillhouette score. We can see the samples in the two clusters that overlap, they are relatively few (the part of the red and blue chunk below 0). The dotted red line represents the average score, which is excellent. The X axis is individual samples though I’ve removed the label to keep things clean!

pam_result_5000_five <- pam(top_5000m, k = 2)
fviz_silhouette(pam_result_5000_five, label=FALSE)
##   cluster size ave.sil.width
## 1       1  760          0.04
## 2       2  293          0.05

This sillhouette chart is hillarious. The loss in sillhouette score comes entirely from the red group which refuses to be separated by PAM whilst the other groups, especially pink (5), are exceptionally segregated with values almost exclusively in the positive domain. Perhaps if more clusters were added, the global or at least a new local optima K value of, say, 15 or 20 would be found, but for now monolithic cluster 1 is dragging the score down.

Finally lets look at an alluvial. Since 10,000 for PAM is very computationally expensive, I will instead show an alluvial at 5000 for many K values as we already computed these above.

top_5000G <- top_5000 %>%
  tibble::rownames_to_column("Gene")
alluvialData <- data.frame(
  K3 = pam_result_5000_three$clustering, 
  K4 = pam_result_5000_four$clustering,
  K5 = pam_result_5000_five$clustering,
  K6 = pam_result_5000_six$clustering
)

ggplot(
  data = alluvialData, 
  aes(axis1 = K3, axis2 = K4, axis3 = K5, axis4 = K6, axis5 = K6)
) +
  geom_alluvium(aes(fill = K6)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_classic() +
  xlab("Amount of Clusters") +
  ylab("Gene Index") +
  labs(title = "PAM for 5000 Samples at K=3,4,5,6")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

Nasty. We can see the monolithic cluster squeezing the other groups away, and making me ponder if a logarithimically scaled cluster size visualization would be preferable. Alas.

This speaks both to the significant goodness-of-cluster that this monolith is exhibiting and to the robust cluster 2 which also undergoes minimal subdivision in higher K values.

A lesson here is that the marginalized groups in an alluvial are often intriguing, making it a poor visualization for those interesting outliers (though they capture macroscopic trends that are still surely significant and more likely to be relevant to the disease or affliction we’re usually trying to capture in bioinformatics).

  1. Heatmaps and Dendrograms

    1. Create a heatmap of the 5,000 genes used in clustering, with an annotation sidebar showing one set of clusters you identified from each clustering method and the sample groups from Assignment
      1. You should have n+1 annotation columns, 1 per clustering method and 1 for the sample groups
        in Assignment 1.
      2. Either create a different heatmap for each method or include all in the same heatmap.
      3. Include a legend and axis labels
      4. Include row and column dendrograms

#3 Gene Heatmap

# Placeholder
mat_100<- data.matrix(top_1000,rownames.force = FALSE)
ht = ComplexHeatmap::Heatmap(mat_100,heatmap_legend_param = list(
        title = "Expression", at = c(0, 50, 100,150)
    ),column_title = "Clustered Genes",show_column_names = FALSE,cluster_columns = TRUE,cluster_rows = TRUE)
ComplexHeatmap::draw(ht)

#
# annotation_data <- data.frame(
#   kmeans100four = as.factor(kmeans_result_100_four$cluster),
#   kmeans1000five = as.factor(kmeans_result_1000_five$cluster),
#   ReferenceGroup = as.factor(reference_group)
# )
# 
# pheatmap(
#   expression_df,
#   annotation_col = annotation_data,
#   show_rownames = FALSE,
#   show_colnames = FALSE
# )

To generate the above heatmap, we utilized the ComplexHeatmap library again using the top 5,000 genes as instructed. We provided a legend based on gene expression, and to cluster the genes within the heatmap, we set the cluster_columns and cluster_rows parameters to “TRUE”. This also displays the dendrograms for the rows and columns. Unfortunately, we were unable to cluster the heatmap according to the clusters we determined in earlier sections of the assignment, and so were unable to annotate sections for said columns. While clusters data do clearly exist, the heatmap unfortunately does not clearly reflect this fact.

  1. Statistics

    1. Does cluster membership correlate with the groups you chose in Assignment 1? Perform a chi-squared test of independence to statistically compare the two. Do this for each clustering result you identified – include all versions per clustering approach. You are comparing your different clustering results.
    2. Adjust all statistical test results for multiple hypothesis testing (p.adjust).
    3. Create a table with statistical test results that includes adjusted and un-adjusted p-values.
    # Alright, let's prep the metadata to better select what we need.
    clusterResults <- list(
        "KMEANS_K5_N10" = kmeans_result_10_five$cluster,
        "KMEANS_K4_N10" = kmeans_result_10_four$cluster,
        "KMEANS_K3_N10" = kmeans_result_10_three$cluster,
        "KMEANS_K5_N100" = kmeans_result_100_five$cluster,
        "KMEANS_K4_N100" = kmeans_result_100_four$cluster,
        "KMEANS_K3_N100" = kmeans_result_100_three$cluster,
        "KMEANS_K5_N1000" = kmeans_result_1000_five$cluster,
        "KMEANS_K4_N1000" = kmeans_result_1000_four$cluster,
        "KMEANS_K3_N1000" = kmeans_result_1000_three$cluster,
        "KMEANS_K5_N10000" = kmeans_result_10000_five$cluster,
        "KMEANS_K4_N10000" = kmeans_result_10000_four$cluster,
        "KMEANS_K3_N10000" = kmeans_result_10000_three$cluster,
        "KMEANS_K5_N5000" = kmeans_result_5000_five$cluster,
        "KMEANS_K4_N5000" = kmeans_result_5000_four$cluster,
        "KMEANS_K3_N5000" = kmeans_result_5000_three$cluster,
        "HCLUSTER_K5_N100" = hcluster_result_100_five,
        "HCLUSTER_K4_N100" = hcluster_result_100_four,
        "HCLUSTER_K3_N100" = hcluster_result_100_three,
        "HCLUSTER_K2_N100" = hcluster_result_100_two,
        "HCLUSTER_K5_N10000" = hcluster_result_10000_five,
        "HCLUSTER_K4_N10000" = hcluster_result_10000_four,
        "HCLUSTER_K3_N10000" = hcluster_result_10000_three,
        "HCLUSTER_K2_N10000" = hcluster_result_10000_two,
        "HCLUSTER_K4_N5000" = hcluster_result_5000_four,
        "HCLUSTER_K3_N5000" = hcluster_result_5000_three,
        "HCLUSTER_K2_N5000" = hcluster_result_5000_two,
        "PAM_K3_N10" = pam_result_10_three$clustering,
        "PAM_K3_N100" = pam_result_100_three$clustering,
        "PAM_K3_N1000" = pam_result_1000_three$clustering,
        "PAM_K3_N10000" = pam_result_10000_three$clustering,
        "PAM_K3_N5000" = pam_result_5000_three$clustering,
        "PAM_K4_N5000" = pam_result_5000_four$clustering,
        "PAM_K5_N5000" = pam_result_5000_five$clustering,
        "PAM_K6_N5000" = pam_result_5000_six$clustering
    )

    Now we need to trim our “culledMeta” to only the columns we care about: accession code, and diabetic vs nondiabetic.

    #
    referenceGroup <- subset(culledMeta, select = c("refinebio_accession_code", "diabetes"))
    referenceGroup <- referenceGroup %>%
      dplyr::mutate(diabetes = dplyr::case_when(
        stringr::str_detect(diabetes, "reference") ~ 0,
        stringr::str_detect(diabetes, "diabetic") ~ 1,
      ))
    
    reference_group <- referenceGroup$diabetes
    results_df <- data.frame(
      Name = character(),
      X_Squared = numeric(),
      Degrees_of_Freedom = integer(),
      P_Value_Adjusted = numeric(),
      stringsAsFactors = FALSE
    )
    
    # Loop through each matrix
    for(name in names(clusterResults)) {
      test_result <- chisq.test(clusterResults[[name]])
      x_squared <- ifelse(is.null(test_result$statistic), 0, test_result$statistic)
      df <- ifelse(is.null(test_result$parameter), 0, test_result$parameter)
      p_value <- ifelse(is.null(test_result$p.value), -1, p.adjust(test_result$p.value))
    
      temp_df <- data.frame(
        Name = name,
        X_Squared = x_squared,
        Degrees_of_Freedom = df,
        P_Value_Adjusted = p_value
      )
    
      results_df <- rbind(results_df, temp_df)
    }
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    
    ## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
    ## incorrect
    print(results_df)
    ##                  Name X_Squared Degrees_of_Freedom P_Value_Adjusted
    ## 1       KMEANS_K5_N10  765.8131               1052     1.0000000000
    ## 2       KMEANS_K4_N10  428.9811               1052     1.0000000000
    ## 3       KMEANS_K3_N10  379.2207               1052     1.0000000000
    ## 4      KMEANS_K5_N100  846.5928               1052     0.9999991795
    ## 5      KMEANS_K4_N100  491.4929               1052     1.0000000000
    ## 6      KMEANS_K3_N100  396.9655               1052     1.0000000000
    ## 7     KMEANS_K5_N1000  668.6595               1052     1.0000000000
    ## 8     KMEANS_K4_N1000  448.9761               1052     1.0000000000
    ## 9     KMEANS_K3_N1000  243.6913               1052     1.0000000000
    ## 10   KMEANS_K5_N10000  494.6049               1052     1.0000000000
    ## 11   KMEANS_K4_N10000  484.4074               1052     1.0000000000
    ## 12   KMEANS_K3_N10000  344.3206               1052     1.0000000000
    ## 13    KMEANS_K5_N5000  706.0422               1052     1.0000000000
    ## 14    KMEANS_K4_N5000  573.0656               1052     1.0000000000
    ## 15    KMEANS_K3_N5000  254.3345               1052     1.0000000000
    ## 16   HCLUSTER_K5_N100 1123.0592               1052     0.0630019421
    ## 17   HCLUSTER_K4_N100  731.8579               1052     1.0000000000
    ## 18   HCLUSTER_K3_N100  461.2623               1052     1.0000000000
    ## 19   HCLUSTER_K2_N100  180.6631               1052     1.0000000000
    ## 20 HCLUSTER_K5_N10000  737.5934               1052     1.0000000000
    ## 21 HCLUSTER_K4_N10000  163.7260               1052     1.0000000000
    ## 22 HCLUSTER_K3_N10000  157.7638               1052     1.0000000000
    ## 23 HCLUSTER_K2_N10000  154.7510               1052     1.0000000000
    ## 24  HCLUSTER_K4_N5000  502.7748               1052     1.0000000000
    ## 25  HCLUSTER_K3_N5000  180.3102               1052     1.0000000000
    ## 26  HCLUSTER_K2_N5000  178.6978               1052     1.0000000000
    ## 27         PAM_K3_N10  368.1386               1052     1.0000000000
    ## 28        PAM_K3_N100  383.8800               1052     1.0000000000
    ## 29       PAM_K3_N1000  346.1091               1052     1.0000000000
    ## 30      PAM_K3_N10000  301.7892               1052     1.0000000000
    ## 31       PAM_K3_N5000  399.6943               1052     1.0000000000
    ## 32       PAM_K4_N5000  683.3583               1052     1.0000000000
    ## 33       PAM_K5_N5000  165.4383               1052     1.0000000000
    ## 34       PAM_K6_N5000 1199.9504               1052     0.0009677833
    for (name in names(clusterResults)) {
        tbl <- table(clusterResults[[name]], reference_group)
        test <- chisq.test(tbl)
        test$p.value <- p.adjust(test$p.value)
        cat("Result for", name, ":\n")
        print(test)
        cat("\n-------------------------------\n")
    }
    ## Result for KMEANS_K5_N10 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 36.982, df = 4, p-value = 1.817e-07
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K4_N10 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 7.4321, df = 3, p-value = 0.05933
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K3_N10 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 4.963, df = 2, p-value = 0.08362
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K5_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 157.28, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K4_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 159.33, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K3_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 145.22, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K5_N1000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 157.64, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K4_N1000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 165.35, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K3_N1000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 158.39, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K5_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 244.09, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K4_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 214.61, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K3_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 168.87, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for KMEANS_K5_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 227.76, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K4_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 181.93, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for KMEANS_K3_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 182.79, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K5_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 98.745, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for HCLUSTER_K4_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 97.41, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for HCLUSTER_K3_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 73.327, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for HCLUSTER_K2_N100 :
    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  tbl
    ## X-squared = 2.5675, df = 1, p-value = 0.1091
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K5_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 91.018, df = 4, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K4_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 7.3542, df = 3, p-value = 0.06143
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K3_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 7.2395, df = 2, p-value = 0.02679
    ## 
    ## 
    ## -------------------------------
    ## Result for HCLUSTER_K2_N10000 :
    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  tbl
    ## X-squared = 5.149, df = 1, p-value = 0.02326
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K4_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 121.88, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
    ## Result for HCLUSTER_K3_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 0.14406, df = 2, p-value = 0.9305
    ## 
    ## 
    ## -------------------------------
    ## Result for HCLUSTER_K2_N5000 :
    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  tbl
    ## X-squared = 0.0049344, df = 1, p-value = 0.944
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K3_N10 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 16.229, df = 2, p-value = 0.0002992
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K3_N100 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 91.235, df = 2, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K3_N1000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 5.1362, df = 2, p-value = 0.07668
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K3_N10000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 22.944, df = 2, p-value = 1.042e-05
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K3_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 59.954, df = 2, p-value = 9.576e-14
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K4_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 91.076, df = 3, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K5_N5000 :
    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  tbl
    ## X-squared = 34.378, df = 1, p-value = 4.538e-09
    ## 
    ## 
    ## -------------------------------
    ## Result for PAM_K6_N5000 :
    ## 
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tbl
    ## X-squared = 90.859, df = 5, p-value < 2.2e-16
    ## 
    ## 
    ## -------------------------------

    These P values are pretty atrocious. It is safe to assume that the groups chosen are indeed independent. This may have to do with the fact that we chose the most variably expressed genes, not the most expressed genes period, which generally picks the genes noisiest accross our samples and thus yields relatively poor clusters (as seen by the oddly low amount of clusters chosen by sillhouette method or the lack of an elbow in the WSS kmeans evaluation).

  2. Write a short summary for each plot/table you created in this assignment. In 3-5 sentences for each, describe what you did, what parameters you used (if any) and an interesting result from it.

  3. As a team, fill out the team evaluation table below.

    Team table

  4. Combine all results into a single file, submit on Canvas. Make sure that all your code is added to your GitHub repository.